The below project is the result of scraping Illinois Department of Agriculture FY19 Apiary Annual Report. This report is in a PDF file format and does not lend itself to easy data extraction. Most extraction from PDFs is still done manually. The goal of this project is to extract data from page 5 of the report, manipulate the data, visualize the data, and export to an excel file.
Tabula: Will be used to extract the data from PDF format.
Pandas: Will be used to clean the data, manipulate the data, and export to an excel format.
Plotly: Will be used to create visualizations of the data.
Extracting the table form Page 5 of the annual report provided imperfect results. With a little of adjustment of the coordinates that Tabula uses to extract teh data did provide marginally better results. Ultimately, all of the information from the table was extracted with the addition of a couple of empty rows. We can see this result below.
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 0 | NaN | Number of | Number | Number | NaN | NaN |
| 1 | NaN | Registered | of | of | Colonies/ | Colonies/ |
| 2 | Year | Beekeepers | Apiaries | Colonies | Beekeeper | Apiary |
| 3 | 2019 | 4,551 | 6,202 | 32,268 | 7.1 | 5.2 |
| 4 | 2018 | 4,308 | 6,000 | 30,017 | 7.0 | 5.0 |
In order to get a workable data frame we will be deleting the first 3 rows of the extracted table as well as the last 2 columns. We will recreate the column headers with simplified names that will ease our later work. The last 2 columns are unnecessary for our purposes as they are derived values.
Also, the extracted data frame is of the object data type with a ',' in the numerical values. We will be deleting the ',' from the numerical items and converting them to a numeric data type.
| year | num_keepers | num_apiaries | num_colonies | |
|---|---|---|---|---|
| 0 | 2019 | 4,551 | 6,202 | 32,268 |
| 1 | 2018 | 4,308 | 6,000 | 30,017 |
| 2 | 2017 | 4,009 | 5,652 | 28,864 |
| 3 | 2016 | 3,522 | 5,136 | 25,339 |
| 4 | 2015 | 3,114 | 4,517 | 24,690 |
| year | num_keepers | num_apiaries | num_colonies | |
|---|---|---|---|---|
| 0 | 2019 | 4551 | 6202 | 32268 |
| 1 | 2018 | 4308 | 6000 | 30017 |
| 2 | 2017 | 4009 | 5652 | 28864 |
| 3 | 2016 | 3522 | 5136 | 25339 |
| 4 | 2015 | 3114 | 4517 | 24690 |
We will be calculating the percent change from the previous year using Pandas .pct_change method. In order to do this we will need to first flip the data from so our year value is represented from minimum to maximum.
| year | num_keepers | num_apiaries | num_colonies | keepers_pct_chg | colonies_pct_chg | apiaries_pct_chg | |
|---|---|---|---|---|---|---|---|
| 0 | 2001 | 1160 | 2038 | 19627 | NaN | NaN | NaN |
| 1 | 2002 | 1107 | 1914 | 17963 | -0.045690 | -0.084781 | -0.060844 |
| 2 | 2003 | 1117 | 1926 | 18649 | 0.009033 | 0.038190 | 0.006270 |
| 3 | 2004 | 1141 | 1940 | 19572 | 0.021486 | 0.049493 | 0.007269 |
| 4 | 2005 | 1213 | 2054 | 27646 | 0.063103 | 0.412528 | 0.058763 |
Below we visualize our data using plotly express. We have created a line graph showing the change in total keepers, colonies, and apiaries from 2001 to 2019.
We also have visualized the percent change over time for our three variables. Here we can see a sharp decline in the number of colonies from 2005 to 2006 with a -27% change year over year. In that same time period the number of keepers grew by 4% and apiaries by 5%. This raises teh question of what caused such a sharp decline in the number of colonies during this time period?